\chapter{Model-free Reinforcement Learning}

In this section we will discuss \textit{model-free} reinforcement learning, in which an agent aims to improve its behavior via interaction with an environment, without explicitly trying to predict state transitions (i.e. dynamics) or rewards. 
By learning a policy directly (possibly along with a value function), the agent can learn to choose good actions without explicitly leveraging model-based control. 


There are several reasons why this model-free approach may be preferable to learning the dynamics of the system. First, the optimal policy may be considerably simpler than the dynamics of the system. An example of this is inventory management, a classic problem in operations research. In this setting, the optimal policy is simply to restock whenever the inventory level is below some threshold, a trivial decision rule. 
% TODO add in inventory example

A second reason is that learning a dynamics model may result in suboptimality of the resulting policy. In dynamics learning, we typically choose a surrogate objective function to optimize the model, such as $L_2$ error. In model-free reinforcement learning, on the other hand, we directly optimize a policy to minimize the expected total cost. Introducing the intermediate objective may induce suboptimality in the policy optimization process. 

We will discuss the two major classes of model-free reinforcement learning algorithms: those based on \textit{temporal difference learning}, such as Q-learning, and those based on \textit{policy gradient} optimization. We will then discuss the intersection of these methods, typically referred to as \textit{actor-critic} methods because they combine an actor (policy, which selects actions) and a critic (value function, which predicts the total cost associated with actions). 

Our discussion in this section will not be comprehensive, and indeed, the modern reinforcement learning literature is growing rapidly. This section is meant to provide an overview of the most common approaches in current use. 

\section{Temporal Difference Learning}

A key concept in reinforcement learning is the idea of \textit{temporal difference} (TD) learning. TD learning combines ideas from statistical inference---in particular, parameter inference with noisy measurements---with dynamic programming. 

Temporal difference learning focuses on learning some flavor of the value function, such as the standard state value function (also referred to as the cost-to-go in these notes), or the state-action value function (also referred to as the Q function). Note, however, that this quantity is not directly observable---the value function can not be measured directly. Thus TD learning turns to the concept of \textit{bootstrapping}, in which the learned value of future states is used to provide a noisy measurement of the current value. In addition to enabling efficient learning of the value function, TD methods also enable \textit{online} control, in which a policy can be immediately updated after observing one transition, as opposed to observing a full episode. 

We will discuss two algorithms: SARSA, which is on-policy, and Q-learning, which is off-policy. We will discuss both of these algorithms in the \textit{tabular} setting. In this setting, we assume our representation of the $Q$ function will take the form of a look up table of each state and action pair. Thus, we necessarily assume a discrete state and action space. We will then discuss combining TD learning with function approximation. This discussion will focus on Q-learning, as the on-policy nature of SARSA has so far been difficult to combine with hyperparametric function approximators such as neural networks. As in our discussion of optimal control in discrete state spaces, we will avoid discussion of the policy evaluation setting, and focus solely on policy improvement (or the control setting). For a thorough discussion of policy evaluation, we refer the reader to \cite{sutton2018reinforcement}.

\subsection{SARSA}

As a first model-free TD learning algorithm, we will consider SARSA. SARSA stands for \textbf{s}tate, \textbf{a}ction, \textbf{r}eward, \textbf{s}tate, \textbf{a}ction, where the last state action pair denote the post-transition state and the action associated with this state. The emphasis on this second action provides a contrast to Q-learning, as we will see later. 

SARSA proceeds as in Algorithm \ref{alg:SARSA}. The agent iteratively interacts with the environment, choosing an action to minimize the learned $Q$ function. This $Q$ function is updated as 
\begin{equation}
    Q(\st_t,\ac_t) \gets Q(\st_t,\ac_t) + \alpha (\cost_t + \gamma Q(\st_{t+1},\ac_{t+1}) - Q(\st_t,\ac_t))
\end{equation}
for $alpha \in (0,1]$. Note that for our discussion of model-free RL, we will use a temporally stationary value function. There are two phases to the SARSA algorithm. One consists of selecting actions to optimize $Q$. For the optimal state-action value function $Q^*$, this yields the optimal policy. 

The second step consists of updating the learned $Q$ function to reduce the \textit{temporal difference error},
\begin{equation}
\cost_t + \gamma Q(\st_{t+1},\ac_{t+1}) - Q(\st_t,\ac_t).
\end{equation}
Note that acting according to the policy for which the $Q$ function is defined, this TD error will be zero in expectation. Thus, at convergence, the policy is optimal for $Q^*$, and the $Q$ function has zero expected error (and is thus at a fixed point) for the associated policy; these two facts together give intuition as to why this approach converges to an optimal policy in the tabular setting. A proof of convergence relies on technical details regarding exploration, which we exclude discussion of for now. 

% TODO add proof of convergence
% can write update as convex combination of previous and Bellman value

% TODO add discussion of step size and stochastic approximation

SARSA is an \textit{on-policy} reinforcement learning algorithm, which means that it must interleave action selection (and thus interacting with the environment) and learning. This is due to the interaction between the greedy (with respect to the $Q$ function) action selection and using the post-transition action ($\ac_{t+1}$) for computing the update to the $Q$ function. Updating the $Q$ function solely with respect to actions from a provided dataset, without controlling the action selection, would learn the policy used in the dataset generation and the associated $Q$ function (a setting referred to as \textit{policy evaluation}). This is in contrast to $Q$-learning (which we will discuss next) which is \textit{off-policy}, and thus able to learn an optimal policy without action selection. 

\subsubsection{$\epsilon$-greedy Exploration}

Reinforcement learning algorithms must focus on \textit{exploration}, as well as \textit{exploitation}. In particular, if greedy action selection is performed, it is possible that some state-action pairs will not be explored, and so the reward from these states is not observed, and thus convergence is not achieved. As opposed to our discussion in the previous section on dual control, we may not assume knowledge of the reward function, or priors on parameters. As a consequence, we must turn to heuristics for efficient exploration. While we will expand on exploration later in this chapter, we will discuss a simple form of exploration typically paired with tabular learning algorithms, \textit{$\epsilon$-greedy} action selection. 

An $\epsilon$-greedy policy assumes some pre-specified $\epsilon \in (0,1)$, and takes the form
\begin{align}
    \pol_{\epsilon}(\st; Q) = \begin{cases}
    \ac_\epsilon \,\, &\text{with probability}\,\, \epsilon\\
    \argmin_{\ac \in \actionspace} \, Q(\st,\ac)\,\, &\text{otherwise}
    \end{cases}
\end{align}
where $\ac_{\epsilon}$ denotes an action sampled uniformly from the action space. Put simply, this policy chooses the optimal action with probability $1-\epsilon$, and otherwise chooses a random action. 

\begin{algorithm}[t]
\caption{SARSA}
\centering
\label{alg:SARSA}

%\begin{minipage}[t]{0.5\textwidth}
\begin{algorithmic}[1]
    \Require Step size $\alpha \in (0,1]$, action selection rule $\pol(\cdot; \cdot)$
    \State Initialize $Q(\st,\ac)$ for all $\st \in \statespace, \ac \in \actionspace$
    \For{each episode, $n = 0, \ldots, N$}
        \State Initialize state $\st_0$
        \State $\ac_t \gets \pol(\st_t; Q)$
        \For{$t = 0, \ldots, T$}
            \State Observe next state $\st_{t+1}$, cost $\cost_{t}$
            \State $\ac_{t+1} \gets \pol(\st_{t+1}; Q)$
            \State $Q(\st_t,\ac_t) \gets Q(\st_t,\ac_t) + \alpha (\cost_t + \gamma Q(\st_{t+1},\ac_{t+1}) - Q(\st_t,\ac_t))$
        \EndFor
    \EndFor
\end{algorithmic}
%\end{minipage}
\end{algorithm}


\subsection{$Q$-Learning}

\begin{algorithm}[t]
\caption{$Q$-learning}
\centering
\label{alg:Q}

%\begin{minipage}[t]{0.5\textwidth}
\begin{algorithmic}[1]
    \Require Step size $\alpha \in (0,1]$, action selection rule $\pol(\cdot; \cdot)$
    \State Initialize $Q(\st,\ac)$ for all $\st \in \statespace, \ac \in \actionspace$
    \For{each episode, $n = 0, \ldots, N$}
        \State Initialize state $\st_0$
        \For{$t = 0, \ldots, T$}
            \State $\ac_t \gets \pol(\st_t; Q)$
            \State Observe next state $\st_{t+1}$, cost $\cost_{t}$
            \State $Q(\st_t,\ac_t) \gets Q(\st_t,\ac_t) + \alpha (\cost_t + \gamma \min_{\ac' \in \actionspace} Q(\st_{t+1},\ac') - Q(\st_t,\ac_t))$
        \EndFor
    \EndFor
\end{algorithmic}
%\end{minipage}
\end{algorithm}

$Q$-learning, named for the state-action value function and detailed in Algorithm \ref{alg:Q}, relies on replacing the on-policy update of $SARSA$ with 
\begin{equation}
    Q(\st_t,\ac_t) \gets Q(\st_t,\ac_t) + \alpha (\cost_t + \gamma \min_{\ac' \in \actionspace} Q(\st_{t+1},\ac') - Q(\st_t,\ac_t)).
\end{equation}
SARSA aimed to perform a greedy action selection step, coupled with a $Q$ function update step that estimated the $Q$ function associated with this greedy action selection (policy evaluation). Recall that
\begin{equation}
    Q^*(\st,\ac) = \cost(\st,\ac) + \E_{\st'}[\J^*(\st')]
\end{equation}
and
\begin{equation}
    \J^*(\st) = \min_{\ac \in \actionspace} Q^*(\st,\ac).
\end{equation}
In contrast to SARSA, $Q$-learning replaces the policy evaluation step with one in which the $Q^*$ is inferred, based on the minimization in the TD error. Interestingly, this results in an off-policy algorithm---for any choice of data-gathering policy, the optimal $Q$ function can be learned (in the tabular case, with sufficient exploration). 

\subsection{$Q$-Learning with Function Approximation}

Up until now, we have considered tabular representations of either policies or $Q$ functions (inducing a tabular policy via maximization over discrete action spaces). For large MDPs, this tabular representation is impractical. For large, discrete MDPs, visiting every state-action pair may be impossible, and we may wish to achieve some notion of \textit{generalization}, in which close states (in some metric) have similar value functions or policies. Moreover, as we move to continuous state spaces, we have two options to represent our $Q$ function. We may either discretize the continuous state space to get a discrete MDP, leaving us susceptible to the curse of dimensionality\footnote{The \textit{curse of dimensionality} is used to describe the exponential growth of the discrete MDP state space as the continuous state space dimension increases.}. Alternatively, we may turn to a parametric representation of our $Q$ function, and optimize these parameters to reduce TD error. As we will see in this section, while these approximation approaches are effective for achieving scalability of $Q$-learning methods, they have poor convergence properties and can be quite unstable in practice. We will focus in this section on combining off-policy methods with function approximation, which have recently been more popular. For a discussion of combining on-policy methods with function approximation, as well as value function approximation, we refer the reader to \cite{sutton2018reinforcement}.

\subsubsection{$Q$-learning with Linear Function Approximation}

Before discussing \textit{nonlinear} approximations such as neural networks, we will discuss combining $Q$-learning with \textit{linear} function approximation. In particular, we will consider functions of the form
\begin{equation}
    Q_{\param}(\st,\ac) = \param^T \feat(\st,\ac)
\end{equation}
where $\feat(\cdot, \cdot)$ is a set of features (or basis functions). Thus, we will replace the tabular update of $Q$-learning with an update rule for the weights, $\param$. We fix as a loss function the squared TD error, defined for a state, action, cost, state tuple $(\st,. \ac, \cost, \st')$ as 
\begin{equation}
    \ell(\param) = \E_{\st,\ac,\st'\sim \rho}\left[ \left(\cost + \gamma \min_{\ac' \in \actionspace} Q_{\param}(\st',\ac') - Q_{\param}(\st,\ac)\right)^2\right]
\end{equation}
which has several convenient properties. Here, $\rho$ is the distribution over states and actions induced by the dynamics of the environment and the policy. Thus, online transitions acting with respect to the current policy (induced by the $Q$ function) will obey this distribution. In general, off-policy learning will not exactly obey this distribution. We will discuss this in the next subsection. For further discussion, we refer the reader to \cite{sutton2018reinforcement}. The gradient of the squared TD error takes the form
\begin{align}
    \nabla_{\param} \ell(\param) &= - \E_{\st,\ac,\st'\sim \rho}\left[  (\cost + \gamma \min_{\ac' \in \actionspace} Q_{\param}(\st',\ac') - Q_{\param}(\st,\ac)) \nabla_{\param} Q_{\param}(\st,\ac) \right]\\
    &= - \E_{\st,\ac,\st'\sim \rho}\left[  (\cost + \gamma \min_{\ac' \in \actionspace} Q_{\param}(\st',\ac') - Q_{\param}(\st,\ac)) \feat(\st,\ac)\right]
\end{align}
where the gradient contribution of the post-transition value is ignored. We will discuss this is more detail when we discuss double $Q$-learning. Note that a Monte Carlo approximation of the expectation over state-action \textit{occupancy}, $\rho$, is provided by online updating as in standard $Q$-learning. This gradient update for a single transition tuple allows us to define approximate version of $Q$-learning and SARSA (if the $\ac'$ taken is used instead of maximizing over the action function). Two approaches are possible: one can update the weights in an online fashion (which we will not discuss here due to this approach being relatively unpopular) or in a batch fashion, averaging the gradient over many transitions, which we will discuss in the next subsection. 

% TODO clarify semi-gradient and missing term

A standard gradient descent update (ignoring the expectation) with step size $\alpha$ for $\param$ takes the form
\begin{equation}
    \param' \gets \param + \alpha (\cost + \gamma \min_{\ac' \in \actionspace} Q_{\param}(\st',\ac') - Q_{\param}(\st,\ac)) \feat(\st,\ac).
\end{equation}
As a special case, consider a discrete state/action space and features corresponding to an indicator function on state-action pairs. This is to say, we consider a $|\statespace|\times|\actionspace|$ dimensional vector where for a given state-action pair, one entry in $\feat$ takes value one and all others are zero. Let $\feat_i$, $\param_i$ denote the nonzero feature entry and the corresponding entry in the parameter vector for state and action $\st, \ac$. Then, the parameter update rule takes the form
\begin{equation}
    \param_i' \gets \param_i + \alpha (\cost + \gamma \min_{\ac' \in \actionspace} Q_{\param}(\st',\ac') - \param_i).
\end{equation}
as $\feat_i = 1$, and $\feat_{j} = 0$ for $j \neq i$. Thus, interpreting the parameters $\param$ as the Q function values for each state-action pair, we have exactly recovered the tabular $Q$-learning update, implying tabular $Q$-learning is a special case of this general parametric update. 

\paragraph{Choice of basis functions.} There are a wide variety of possible choices for $\feat$, as in standard basis function regression. Common choices aim to capture some notion of the the local behavior of $Q$ functions; common choices include tile codings and coarse codings (i.e. indicator functions for regions of state space), or radial basis functions, which have exponentially declining importance further from their centers. Alternatively, for expressive function representations, common choices include polynomials and Fourier basis functions. As in standard basis function regression, domain expertise should be utilized whenever possible in the design of the features. 

\paragraph{Convergence of approximate $Q$-learning.} The combination of off-policy $Q$-learning and function approximation is not, in general, guaranteed to converge. We refer the reader to \cite{sutton2018reinforcement} for a collection of counter-examples. In general, the combination of function approximation, bootstrapping, and off-policy training is referred to as the \textit{deadly triad}, and typically results in instability. 

\subsubsection{Neural Fitted $Q$-learning and DQN}

While we have so far discussed $Q$-learning using linear regression on nonlinear features, we will now move to the setting in which the features are learned. Due to their differentiability, expressiveness, and good empirical performance (especially for image inputs), a common choice is a neural network. This approach is typically referred to as either \textit{neural fitted $Q$ iteration} \cite{riedmiller2005neural}, or as the \textit{deep $Q$ network} (DQN), \cite{mnih2015human}. Given the neural network parameterization, the squared TD error loss function has the gradient
\begin{equation}
    \nabla_{\param} \ell(\param) = - \E_{\st,\ac,\st'\sim \rho}\left[ (\cost + \gamma \min_{\ac' \in \actionspace} Q_{\param}(\st',\ac') - Q_{\param}(\st,\ac)) \nabla_{\param} Q_{\param}(\st,\ac) \right]
\end{equation}
as for standard fitted $Q$-learning. The $Q$ function approximation can, using Monte Carlo approximation of this gradient and backpropagation for neural network optimization, update the parameters of the neural network. 

There are two major non-obvious modifications to standard fitted $Q$-learning that make deep $Q$ networks successful: experience replay \cite{lin1993reinforcement} and lagged $Q$ function updates.

% TODO add algorithm block for DQN

\paragraph{Experience replay and off-policy training.} 
Neural networks trained in a strictly online fashion, taking gradient steps with respect to the most recent observed input-output pair, result in unstable training, forgetting, and possible divergence. To avoid this, we turn to \textit{experience replay}, in which a history of $(\st, \ac, \cost, \st')$ transitions is stored. For training, a minibatch of these transitions is taken from the experience replay buffer, and used to compute a Monte Carlo estimate of the squared TD error, and then used for gradient descent. 

How do we square the logging of old data with the expectation over the distribution induced by the policy? Is this algorithm on-policy or off-policy? In practice, the entries in the replay buffer are biased toward more recent samples. Thus, the approach is neither completely on-policy or off-policy. In practice, design of this replay buffer is based on heuristics.

\paragraph{Lagged $Q$ updates.} We define the \textit{target} $Q$ function as $Q(\st',\ac')$, the inner $Q$ function within the update, $\min_{\ac' \in \actionspace} Q(\st',\ac')$. To stabilize the DQN learning, it is necessary to represent this with a different network from the other $Q$ function that appears in the TD error equation. We will refer to the parameters of this network  as $\param'$. This network is not learned independently; instead, it is set to the weights of the other $Q$ network on a lagged schedule. A typical approach is to set $\param' = \param$ every $k$ steps, where $k$ is some large constant (e.g. 1000). 

% TODO: add discussion of bias in Q function; double Q learning, TD3.

% TODO: add discussion of other approaches to stabilize learning; 
% - reward clipping
% - gradient clipping 

\section{Policy Gradient and Monte Carlo Policy Improvement}

In the previous section we discussed methods that learned $Q$ functions, for which action selection consisted of greedy optimization. In this section, we will discuss methods that directly optimize the policy with respect to total expected cost. These methods do not rely on bootstrapping, unlike $Q$-learning and SARSA. 

We will first discuss Monte Carlo methods in the tabular setting, in which the state and action space are discrete. This discussion will cover both the policy evaluation and the improvement setting. Following this, we will discuss the policy gradient approach, which allows optimization of parameterized policies and will be a key tool in our discussion of actor-critic methods in the next section. 
We assume the initial state at the start of each episode is $\st_0 \sim p(\st_0)$, and that the MDP does not change between episodes. 

\subsection{Monte Carlo Policy Evaluation}

Before considering policy improvement, we will discuss the policy evaluation setting, in which we aim to estimate the value function associated with a fixed policy. We will assume episodic interaction with the environment, and that the value function is not time varying.
% TODO add note discussing finite horizon with non time varying 

The algorithm is presented in Algorithm \ref{alg:MCPE}. Here, $G$ denotes the total return for an episode. The rollouts (trajectories) may be computed via interaction with the true MDP or a simulated system. Algorithm \ref{alg:MCPE} shows \textit{first-visit} evaluation. In this approach, the value for each state is estimated as the average (over episodes) of the sum of costs from the first time a state is visited to the end of the episode. This is in contrast to \textit{every-visit} evaluation, in which the the sum of costs from each time a state is visited is used to estimate the value. 

In addition to Monte Carlo estimation of value functions, we may also perform Monte Carlo estimation of $Q$ functions. This approach will be useful for model-free policy improvement. Monte Carlo $Q$-function policy estimation proceeds identically to value function estimation, but replaces the estimation step 
\begin{equation}
    \hat{\J}(\st_t) \gets G
\end{equation}
with a state-action estimation step
\begin{equation}
    \hat{Q}(\st_t, \ac_t) \gets G
\end{equation}
and similarly, the $Q$ function is the average of the returns across episodes. Note that in the tabular setting, this has a limitation: instead of having to estimate $|\statespace|$ entries, we must now estimate $|\statespace| \times |\actionspace|$ entries. Moreover, for a deterministic policy, many state-action pairs will never be visited. Thus, when used as an inner subroutine in policy improvement, sufficient exploration will be critical. 

\begin{algorithm}[t]
\caption{First-Visit Monte Carlo Policy Evaluation}
\centering
\label{alg:MCPE}

%\begin{minipage}[t]{0.5\textwidth}
\begin{algorithmic}[1]
    \Require Policy $\pol$.
    \State Initialize $\J(\st)$ arbitrarily for all $\st\in \statespace$, empty list $\text{Returns}(\st)$ for all $\st \in \statespace$
    \For{each episode, $n = 0, \ldots, N$}
        \State Generate rollout $\tau = (\st_0, \ac_0, \cost_0, \ldots, \st_T, \ac_T, \cost_T)$
        \State $G \gets 0$
        \For{$t = T, \ldots, 0$}
            \State $G \gets \gamma G + \cost_t$
            % \If{$\st_t \notin \{\st_T, \ldots, \st_{t+1}\}$}
            \State $\hat{\J}(\st_t) \gets G$
            % \EndIf
        \EndFor
        \State Append $\hat{\J}(\st)$ to $\text{Returns}(\st)$ for all $\st \in \statespace$
        \State $\J(\st) \gets \text{average(Returns}(\st))$ for all $\st \in \statespace$
    \EndFor
\end{algorithmic}
%\end{minipage}
\end{algorithm}


\subsection{Monte Carlo Policy Improvement}

Having constructed an approach for Monte Carlo policy evaluation, we may now consider policy improvement. Monte Carlo policy improvement is an instantiation of generalized policy iteration, where the evaluation step is the $Q$ function policy evaluation step described previously. This is interleaved with a greedy policy improvement step, of the form
\begin{equation}
    \pol(\st) \gets \argmin_{\ac\in\actionspace}Q(\st,\ac).
\end{equation}
These two steps are typically iterated between after every episode. That is, the standard pipeline is to first, given a policy, rollout on the environment. Given that rollout, the policy estimation phase may be performed. Following this, policy improvement is performed, and the loop begins again. 

In contrast to standard policy iteration, this approach does not iterate over every state in the policy evaluation step. Thus, effective policy evaluation relies on sufficient exploration. A standard approach is to consider \textit{soft} policies, which have non-zero probability associated with each action, at each state. Additionally, as time progresses, these policies become less soft (i.e. put less probability mass on sub-optimal actions) to ensure convergence. A typical approach is $\epsilon$-greedy, as discussed in the previous section. Given some technical conditions on the decay of $\epsilon$, Monte Carlo policy improvement can be shown to converge to the optimal policy. 

% TODO add proof, discussion of epsilon greedy convergence

\subsection{The Likelihood Ratio Trick and REINFORCE}

We have so far considered direct policy optimization via Monte Carlo methods in the tabular setting. We will now look at the approximate setting, in which we aim to optimize a parameterized policy, which we write $\pol_{\param}$. We will also consider stochastic policies, which we will write as $\pol_{\param}(\ac\mid\st)$, where $\pol_{\param}(\cdot\mid\st)$ is a conditional probability density. We will write the \textit{return} of the policy as a function of the policy parameters as 
\begin{equation}
    G(\param) = \E_{\st_0}[\J^{\pol_{\param}}(\st_0)]
\end{equation}
where we now define the value as 
\begin{equation}
    \J^{\pol}(\st) = \E_{\ac\sim\pol(\cdot\mid\st), \st'\sim p(\cdot\mid\st, \ac)}[\cost(\st,\ac) + \gamma \J^{\pol}(\st')].
\end{equation}

To optimize this policy via gradient descent on the return, we turn to the policy gradient theorem. 

\theorem{Let $G(\param)$ denote the expected return of policy $\pol_{\param}$. Let $ \tau = (\st_0, \ac_0, \ldots, \st_T, \ac_T)$ denote a finite horizon trajectory generated by the dynamics of the environment and the policy, with density $p(\tau \mid \param)$. We will write the total cost associated with the trajectory $\tau$ as $\cost(\tau)$. Then,
\begin{equation}
\nabla_{\param} G(\param) = \E_{\tau \sim p(\cdot\mid \param)}[\cost(\tau) \sum_{t=0}^T \nabla_{\param} \log \pol_{\param}(\ac_t\mid \st_t)]    
\end{equation}
}

\proof{The proof proceeds via simple calculus and rearrangement of terms. We will use the identity 
\begin{equation}
 p_{\param}(\st) \nabla_{\param} \log p_{\param}(\st) = \nabla_{\param} p_{\param}(\st).
\end{equation}

Given this, we have 
\begin{align}
    \nabla_{\param} G(\param) &= \nabla_{\param} \E_{\tau \sim p(\cdot\mid \param)}[ \cost(\tau)] \nonumber\\
    &= \nabla_{\param} \int p(\tau\mid \param) \cost(\tau) d\tau \nonumber\\
    &= \int \cost(\tau ) \nabla_{\param} p(\tau\mid \param) d\tau \nonumber\\
    &= \int \cost(\tau ) p(\tau\mid \param) \nabla_{\param} \log p(\tau\mid \param) d\tau \nonumber\\
    &= \E_{\tau \sim p(\cdot\mid \param)}[ \cost(\tau) \nabla_{\param} \log p(\tau\mid \param)] \label{eq:polgrad1}.
\end{align}
Note that 
\begin{equation*}
    p(\tau\mid \param) = p(\st_0) \prod_{k=0}^{T-1} p(\st_{k+1} \mid \ac_k) \pol_{\param}(\ac_k \mid \st_k),
\end{equation*}
and thus 
\begin{equation*}
    \log p(\tau\mid \param) = \log p(\st_0) + \sum_{k=0}^{T-1} \Large{(} \log p(\st_{k+1} \mid \ac_k) + \log \pol_{\param}(\ac_k \mid \st_k) \Large{)}.
\end{equation*}
Moreover, note that all the terms in the above are independent of $\param$, with the exception of the policy. Thus, we have $\nabla_{\param} \log p(\tau\mid \param) = \sum_{k=0}^{T-1} \nabla_{\param} \log \pol_{\param}(\ac_k \mid \st_k)$. Plugging this into \eqref{eq:polgrad1} completes the proof.
}

By using Monte Carlo estimation of this gradient from experience (as opposed to Monte Carlo policy evaluation as in the last subsection), we can take (noisy) gradient descent steps on the total return to optimize the policy from experience. Indeed, note that on-policy behavior (i.e. acting in the real environment with actions drawn from $\pol_{\param}$ provides samples from $p(\tau\mid \param)$.

% algorithm block

% specific instantiation with gaussian policy 

\section{Actor-Critic Methods}

% general policy gradient theorem

% A2C, A3C, other RL algorithms

\subsection{Variance Reduction}

\subsubsection{Importance Sampling}

\subsubsection{Control Variates}

\section{Exploration}

\section{Bibliographic Notes}